128 research outputs found

    Productivity differences between and within countries

    Get PDF
    July 6, 200

    LinkTransformer: A Unified Package for Record Linkage with Transformer Language Models

    Full text link
    Linking information across sources is fundamental to a variety of analyses in social science, business, and government. While large language models (LLMs) offer enormous promise for improving record linkage in noisy datasets, in many domains approximate string matching packages in popular softwares such as R and Stata remain predominant. These packages have clean, simple interfaces and can be easily extended to a diversity of languages. Our open-source package LinkTransformer aims to extend the familiarity and ease-of-use of popular string matching methods to deep learning. It is a general purpose package for record linkage with transformer LLMs that treats record linkage as a text retrieval problem. At its core is an off-the-shelf toolkit for applying transformer models to record linkage with four lines of code. LinkTransformer contains a rich repository of pre-trained transformer semantic similarity models for multiple languages and supports easy integration of any transformer language model from Hugging Face or OpenAI. It supports standard functionality such as blocking and linking on multiple noisy fields. LinkTransformer APIs also perform other common text data processing tasks, e.g., aggregation, noisy de-duplication, and translation-free cross-lingual linkage. Importantly, LinkTransformer also contains comprehensive tools for efficient model tuning, to facilitate different levels of customization when off-the-shelf models do not provide the required accuracy. Finally, to promote reusability, reproducibility, and extensibility, LinkTransformer makes it easy for users to contribute their custom-trained models to its model hub. By combining transformer language models with intuitive APIs that will be familiar to many users of popular string matching packages, LinkTransformer aims to democratize the benefits of LLMs among those who may be less familiar with deep learning frameworks

    A Massive Scale Semantic Similarity Dataset of Historical English

    Full text link
    A diversity of tasks use language models trained on semantic similarity data. While there are a variety of datasets that capture semantic similarity, they are either constructed from modern web data or are relatively small datasets created in the past decade by human annotators. This study utilizes a novel source, newly digitized articles from off-copyright, local U.S. newspapers, to assemble a massive-scale semantic similarity dataset spanning 70 years from 1920 to 1989 and containing nearly 400M positive semantic similarity pairs. Historically, around half of articles in U.S. local newspapers came from newswires like the Associated Press. While local papers reproduced articles from the newswire, they wrote their own headlines, which form abstractive summaries of the associated articles. We associate articles and their headlines by exploiting document layouts and language understanding. We then use deep neural methods to detect which articles are from the same underlying source, in the presence of substantial noise and abridgement. The headlines of reproduced articles form positive semantic similarity pairs. The resulting publicly available HEADLINES dataset is significantly larger than most existing semantic similarity datasets and covers a much longer span of time. It will facilitate the application of contrastively trained semantic similarity models to a variety of tasks, including the study of semantic change across space and time

    Productivity Differences Between and Within Countries

    Get PDF
    We document substantial within-country (cross-municipality) differences in incomes for a large number of countries in the Americas. A significant fraction of the within-country differences cannot be explained by observed human capital. We conjecture that the sources of within-country and between-country differences are related. As a first step towards a united framework, we propose a simple model incorporating both differences in technological know-how across countries and differences in productive efficiency within countries.

    Essays in economic development and political economy

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Economics, 2012.Cataloged from PDF version of thesis.Includes bibliographical references (p. 183-197).This thesis examines three topics. The first chapter, entitled "Persistent Effects of Peru's Mining Mita" utilizes regression discontinuity to examine the long-run impacts of the mita, an extensive forced mining labor system in effect in Peru and Bolivia between 1573 and 1812. Results indicate that a mita effect lowers household consumption by around 25% and increases the prevalence of stunted growth in children by around six percentage points in subjected districts today. Using data from the Spanish Empire and Peruvian Republic to trace channels of institutional persistence, I show that the mita's influence has persisted through its impacts on land tenure and public goods provision. Mita districts historically had fewer large landowners and lower educational attainment. Today, they are less integrated into road networks, and their residents are substantially more likely to be subsistence farmers. The second chapter, entitled "Trafficking Networks and the Mexican Drug War" examines how drug traffickers' economic objectives influence the direct and spillover effects of Mexican policy towards the drug trade. Drug trade-related violence has escalated dramatically in Mexico during the past five years, claiming over 40,000 lives. By exploiting variation from close mayoral elections and a network model of drug trafficking, the study develops three sets of results. First, regression discontinuity estimates show that drug trade-related violence in a municipality increases substantially after the close election of a mayor from the conservative National Action Party (PAN), which has spearheaded the war on drug trafficking. This violence consists primarily of individuals involved in the drug trade killing each other. The empirical evidence suggests that the violence reflects rival traffickers' attempts to wrest control of territories after crackdowns initiated by PAN mayors have challenged the incumbent criminals. Second, the study predicts the diversion of drug traffic following close PAN victories by estimating a model of equilibrium routes for trafficking drugs across the Mexican road network to the U.S. When drug traffic is diverted to other municipalities, drug trade-related violence in these municipalities increases. Moreover, female labor force participation and informal sector wages fall, corroborating qualitative evidence that traffickers extort informal sector producers. Finally, the study uses the trafficking model and estimated spillover effects to examine the allocation of law enforcement resources. Overall, the results demonstrate how traffickers' economic objectives and constraints imposed by the routes network affect the policy outcomes of the Mexican Drug War. The third chapter, entitled "Insurgency and Long-Run Development: Lessons from the Mexican Revolution" exploits within-state variation in drought severity to identify how insurgency during the Mexican Revolution, a major early 20th century armed conflict, impacted subsequent government policies and long-run economic development. Using a novel municipal-level dataset on revolutionary insurgency, the study documents that municipalities experiencing severe drought just prior to the Revolution were substantially more likely to have insurgent activity than municipalities where drought was less severe. Many insurgents demanded land reform, and following the Revolution, Mexico redistributed over half of its surface area in the form of ejidos: farms comprised of individual and communal plots that were granted to a group of petitioners. Rights to ejido plots were non-transferable, renting plots was prohibited, and many decisions about the use of ejido lands had to be countersigned by politicians. Instrumental variables estimates show that municipalities with revolutionary insurgency had 22 percentage points more of their surface area redistributed as ejidos. Today, insurgent municipalities are 20 percentage points more agricultural and 6 percentage points less industrial. Incomes in insurgent municipalities are lower and alternations between political parties for the mayorship have been substantially less common. Overall, the results support the hypothesis that land reform, while successful at placating insurgent regions, stymied long-run economic development.by Melissa Dell.Ph.D

    What Do We Learn from the Weather? The New Climate-Economy Literature

    Get PDF
    A rapidly growing body of research applies panel methods to examine how temperature, precipitation, and windstorms influence economic outcomes. These studies focus on changes in weather realizations over time within a given spatial area and demonstrate impacts on agricultural output, industrial output, labor productivity, energy demand, health, conflict, and economic growth, among other outcomes. By harnessing exogenous variation over time within a given spatial unit, these studies help credibly identify (i) the breadth of channels linking weather and the economy, (ii) heterogeneous treatment effects across different types of locations, and (iii) nonlinear effects of weather variables. This paper reviews the new literature with two purposes. First, we summarize recent work, providing a guide to its methodologies, datasets, and findings. Second, we consider applications of the new literature, including insights for the "damage function" within models that seek to assess the potential economic effects of future climate change

    Noise-Robust De-Duplication at Scale

    Full text link
    Identifying near duplicates within large, noisy text corpora has a myriad of applications that range from de-duplicating training datasets, reducing privacy risk, and evaluating test set leakage, to identifying reproduced news articles and literature within large corpora. Across these diverse applications, the overwhelming majority of work relies on N-grams. Limited efforts have been made to evaluate how well N-gram methods perform, in part because it is unclear how one could create an unbiased evaluation dataset for a massive corpus. This study uses the unique timeliness of historical news wires to create a 27,210 document dataset, with 122,876 positive duplicate pairs, for studying noise-robust de-duplication. The time-sensitivity of news makes comprehensive hand labelling feasible - despite the massive overall size of the corpus - as duplicates occur within a narrow date range. The study then develops and evaluates a range of de-duplication methods: hashing and N-gram overlap (which predominate in the literature), a contrastively trained bi-encoder, and a re-rank style approach combining a bi- and cross-encoder. The neural approaches significantly outperform hashing and N-gram overlap. We show that the bi-encoder scales well, de-duplicating a 10 million article corpus on a single GPU card in a matter of hours. The public release of our NEWS-COPY de-duplication dataset will facilitate further research and applications

    Linking Representations with Multimodal Contrastive Learning

    Full text link
    Many applications require grouping instances contained in diverse document datasets into classes. Most widely used methods do not employ deep learning and do not exploit the inherently multimodal nature of documents. Notably, record linkage is typically conceptualized as a string-matching problem. This study develops CLIPPINGS, (Contrastively Linking Pooled Pre-trained Embeddings), a multimodal framework for record linkage. CLIPPINGS employs end-to-end training of symmetric vision and language bi-encoders, aligned through contrastive language-image pre-training, to learn a metric space where the pooled image-text representation for a given instance is close to representations in the same class and distant from representations in different classes. At inference time, instances can be linked by retrieving their nearest neighbor from an offline exemplar embedding index or by clustering their representations. The study examines two challenging applications: constructing comprehensive supply chains for mid-20th century Japan through linking firm level financial records - with each firm name represented by its crop in the document image and the corresponding OCR - and detecting which image-caption pairs in a massive corpus of historical U.S. newspapers came from the same underlying photo wire source. CLIPPINGS outperforms widely used string matching methods by a wide margin and also outperforms unimodal methods. Moreover, a purely self-supervised model trained on only image-OCR pairs also outperforms popular string-matching methods without requiring any labels

    Temperature and Income: Reconciling New Cross-Sectional and Panel Estimates

    Get PDF
    This paper presents novel evidence and analysis of the relationship between temperature and income. First, using sub-national data from 12 countries in the Americas, we provide new evidence that the negative cross-country relationship between temperature and income also exists within countries and even within states. Second, we provide a theoretical framework for reconciling the substantial, negative association between temperature and income in the cross-section with the even stronger short-run effects of temperature estimated by panel models. The theoretical framework suggests that half of the negative short-term effects of temperature may be offset in the long run through adaptation.
    • …
    corecore